Mobile Web Scraping

ABSTRACT

Methods, systems and computer program products implementing data aggregation using distributed Web scraping are disclosed. A mobile device can scrape one or more target sites to collect data from accounts of a particular user. The scraping can occur under scraping conditions as specified by the user. The scraping conditions can include conditions based on time, power, bandwidth, usage, or any combination of the above. The scraping conditions can ensure that the scraping occurs at time that is most convenient to the user, e.g., when sufficient bandwidth is available to the mobile device or the mobile device is not performing other tasks. The mobile device can upload the scraped data to a data aggregation server under submission conditions as specified by the user. The data aggregation server can aggregate the scraped data, enrich the aggregated data, and provide the enriched data to the user through Web access.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a divisional application of and claims priority toU.S. application Ser. No. 15/624,578, filed on Jun. 15, 2017, thedisclosure of which is incorporated by reference.

TECHNICAL FIELD

This disclosure relates generally to transaction data processing.

BACKGROUND

User account data, or simply account data, can include data describingtransactions between service providers and customers. The serviceproviders can include, for example, Web publishers, hospitals, onlinemerchants, or financial institutions. The customers can include,respectively for example, client computers, patients, shoppers, or bankcustomers. A data aggregation server can gather the account data andenrich the account data for data analyzers, e.g., research institutesfor studying content download patterns, health trends, shopping trends,and bank service demand. Enriching the account data can include, forexample, organizing the account data into categories, filtering theaccount data, correcting misspelled terms. The data mining server canobtain the account data by scraping sites of the service providers. Datascraping can include automatically logging into a site by a data miningserver and extracting data from the site.

SUMMARY

Techniques of data aggregation using distributed data scraping aredisclosed. A mobile device can scrape one or more target sites tocollect data from accounts of a particular user. The scraping can occurunder scraping conditions as specified by the user. The scrapingconditions can include conditions based on time, power, bandwidth,usage, or any combination of the above. The scraping conditions canensure that the scraping occurs at time that is most convenient to theuser, e.g., when sufficient bandwidth is available to the mobile deviceor the mobile device is not performing other tasks. The mobile devicecan upload the scraped data to a data aggregation server undersubmission conditions as specified by the user. The data aggregationserver can aggregate the scraped data, enrich the aggregated data, andprovide the enriched data to the user through Web access.

The features described in this specification can be implemented toachieve one or more advantages over conventional data scrapingtechniques. For example, the disclosed techniques are more flexible thanconventional data scraping where the data gathering is performed at adata mining server. Distributed data scraping does not require theserver to store user credentials for accessing target sites, therebyenhancing security. Some target sites may have enhanced securityfeatures that prevent server login and only permits login fromregistered mobile devices. These security features can defeatconventional data scraping techniques where scraping is originated froma server. The disclosed techniques allow data scraping from the registermobile devices, and are suitable to be implemented under such enhancedsecurity features of the target sites.

Compared to conventional techniques, the disclosed techniques are moreconvenient to the user. The data that the user requests can bedownloaded to a mobile device of a user directly, instead of or inaddition to through a server. The direct download time can be scheduledby the user according to the user's needs, and is not constrained byscraping schedules of the server.

The disclosed techniques are more fault tolerant. By bypassing theserver, the direct download reduces latency and avoids interruptionscaused by server down time. Using mobile devices, e.g., users'smartphones as data gatherers, a server's needs for bandwidth andprocessing power can be reduced. The reduced needs can reducemaintenance cost of a data analyzer for operating the server.

The details of one or more implementations of the disclosed subjectmatter are set forth in the accompanying drawings and the descriptionbelow. Other features, aspects and advantages of the disclosed subjectmatter will become apparent from the description, the drawings and theclaims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram illustrating an example information extractionsystem.

FIG. 2 is a block diagram illustrating example communication channelsfrom a mobile device and a data aggregation server.

FIG. 3 is a block diagram illustrating an example flow in distributeddata scraping.

FIG. 4 is a block diagram illustrating functional blocks of an exampleinformation extraction system.

FIG. 5 is a block diagram of a conventional information extractionsystem for reference.

FIGS. 6A-6D illustrate example user interfaces of an informationextraction system.

FIG. 7 is an example configuration user interface of an informationextraction system.

FIG. 8 is a flowcharts illustrating a first example process ofinformation extraction by a mobile device.

FIG. 9 is a flowcharts illustrating a second example process ofinformation extraction by a mobile device.

FIG. 10 is a block diagram illustrating an example device architectureof a mobile device implementing the features and operations described inreference to FIGS. 1-9.

FIG. 11 is a block diagram of an example network operating environmentfor the mobile devices of FIGS. 1-9.

Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

FIG. 1 is a block diagram illustrating an example information extractionsystem 100. The information extraction system 100 includes a mobiledevice 102 that includes one or more processors programmed to performdata collection operations. The mobile device 102 is configured togather account data through site scraping. The mobile device 102 can bea smartphone, a tablet computer, a laptop computer, a wearable device orvarious devices that have wired or wireless communication features andare operated by a user.

The mobile device 102 can receive a synchronization notification 104from a program executing on the mobile device 102 or from a dataaggregation server 106. The data aggregation server 106 can include oneor more computers of a data enrichment and analysis platform, e.g., aYodlee® data platform. The synchronization notification 104 can be amessage, or an event, that indicates that the mobile device 102 shallcollect account data from one or more target sites 108.

An application executing on the mobile device 102 can submit thesynchronization notification 104 automatically, under certainsystem-specified or user-specified schedules. For example, theapplication can be configured to submit the synchronization notification104 on a system-specified or user-specified schedule, e.g., at hour Xevery day, Y minutes after all activities including motions and usescease at the mobile device 102, immediately after each financialtransaction, among various other schedules. The application can submitthe synchronization notification 104 according to the schedule upondetermining that, at time scheduled to collect the account data, certainsystem-specified or user-specified conditions are satisfied, forexample, when Wi-Fi™ connections are available, when battery level isabove a threshold, etc.

The synchronization notification 104 can include instructions forscraping various account data. The account data can include datadescribing transactions and, more generally, data describing status ofone or more accounts of a user. For example, account data on a user'shealth care account can include transactional information about, forexample, a time the user received a particular prescription, a name ofthe prescription, a quantity of the prescription, a cost of theprescription, and additionally or alternatively, status data, e.g., ahistory of past prescriptions. Likewise, account data on a user'sfinancial account can include transaction records of a user's deposits,withdrawals, purchases, and additionally or alternatively, status dataincluding current balances, monthly minimum and maximum balances, etc.

In some implementations, the mobile device 102 can receive thesynchronization notification 104 from the data aggregation server 106.The data aggregation server 106 can include one or more computersprogramed to generate a report using data from the target sites 108. Thereport can be an aggregated report, e.g., a monthly status report, oftransaction data from multiple target sites 108. For example, the reportcan include an aggregated prescription report from multiple physicians,e.g., a family doctor, a gastronomist and a cardiologist of a samepatient, or an aggregated financial statement from bank account, aninvestment account and a credit card account of a customer. The patientor the customer can specify content of the report through a userinterface provided by the data aggregation server 106. The report caninclude enriched data. For example, the report can aggregate a paymentcard account and a bank account configured to automatically pay thebalance of the payment card account. The data aggregation server 106 canaggregate the bank account data and the credit account data and predicta bank account balance on a future date.

To generate the report, the data aggregation server 106 collects datafrom various sources. Collecting the data includes scraping account datafrom the target sites 108. The data aggregation server 106 can delegatethe data scraping tasks to the mobile device 102. The delegation canallow the data aggregation server 106 to distribute data collection jobsfor multiple users to multiple mobile devices instead of performing thescraping tasks on the data aggregation server 106. The delegationde-centralizes data collection to multiple user devices including themobile device 102.

The target sites 108 can include Web sites or other types of datadownload sites, e.g., FTP (file transfer protocol) sites or mailservers. The target sites 108 correspond to one or more serviceproviders. The service providers can include health care providers,schools, financial institutes, among others. The target sites 108 mayimplement various security features, e.g., measures that prevent devicesother than a device from which a user registered with the target sites108. These security features can prevent the data aggregation server 106to access the target sites 108 directly. The decentralized datacollection can allow the data aggregation server 106 to collect datathrough registered devices, e.g., the mobile device 102. Accordingly,the data aggregation server 106 can generate the requested reportwithout resorting to requesting the target sites 108 to open securitybackdoors for the data aggregation server 106.

Collecting 110 account data from the target sites 108 can includelogging into each of the target sites 108 using user credentials storedon the mobile device 102. The mobile device 102 can then navigate thepages of target sites 108, parse the pages, and retrieve the accountdata using various scraping techniques.

Upon receiving the scraped data, the mobile device 102 can performvarious updating operations. The updating operations can includesubmitting a refresh request 112 to the data aggregation server 106. Therefresh request 112 notifies the data aggregation server 106 thatscraped data is available to the data aggregation server 106. The dataaggregation server 106, in response, can perform actions to retrieve thedata collected by the mobile device 102. For example, the dataaggregation server 106 can launch (114) a gatherer 116 and an agent 118for collecting the scraped data. The gatherer 116 can include an agentmanager process that spawns one or more agents 118. Each agent 118 cancorrespond to a respective target site 108. Different target sites 108can correspond to different agents 118. The gatherer 116 and an agent118 are computer programs configured to execute by one or moreprocessors and cause the one or more processors to gather collectedaccount data from the mobile device 102.

The updating operations performed by the mobile device 102 can includepreparing the scraped account data for the data aggregation server 106.Preparing the scraped data can include wrapping the scraped data in astandard format, e.g., the XML (extensible markup language) format forstorage and for transmission. In some implementations, the mobile device102 can provide formatted data 120 to an intermediate storage server(ISS) 122 for temporary or permanent storage. The ISS 122 is an optionalcomponent of the information extraction system 100 including one or morenon-transitory storage devices. In some implementations, the ISS 122 canbe implemented using representational state transfer (REST) Webservices.

The agent 118, after being launched, can retrieve the formatted data 120from the mobile device 102 or from the ISS 122. The agent 118 cannormalize the retrieved data, and provide the normalized data to thegatherer 116. The gatherer 116 can provide (124) the data gathered fromone or more agents 118 to the data aggregation server 106.

In some implementations, the data aggregation server 106 and the mobiledevice 102 can communicate through a communication channel 124. Themobile device 102 can read various data, including scrapinginstructions, categorized data and aggregated reports, from the dataaggregation server 106 through the communication channel 124.

The data aggregation server 106 can include, or be coupled with, a datastore 126. The data store 126 can include a non-transitorycomputer-readable medium storing aggregated data, e.g., data scraped bymobile device 102 over a period of time and other relevant data, e.g.,metadata and data scraped by the data aggregation server 106. The datastore 126 can store particular user credentials for accessing the targetsites 108. The data aggregation server 106 can use the stored usercredentials to scrape account data from those target sites 108 that arenot scraped by the mobile device 102 but are requested by the usernonetheless. The mobile device 102 can read the aggregated data, andreports generated from the aggregated data, using the interactive APIthrough the communication channel 124. The mobile device 102 can presentthe aggregated data on a display surface of the mobile device 102, orstore the aggregated data for later use. In various implementations, aclient device other than the mobile device 102, e.g., a desktopcomputer, can access and store or present the aggregated data and thereports.

FIG. 2 is a block diagram illustrating example communication channelsfrom a mobile device 102 and a data aggregation server 106. The mobiledevice 102 includes one or more processors configured to execute one ormore application programs. For convenience, an account managementapplication is described. Various applications, e.g., healthcare recordmanagement applications, inventory management applications, or chainstore business management applications, can be implemented similarly. Ascraping mobile application 202 that is configured to collect accountdata by scraping various target sites, e.g., Web sites of serviceproviders and provide account management functions, e.g., fund transfer,payment scheduling, direct deposit setup, statement summary, amongothers.

The mobile device 102 installs a data collection library 204 that isused by the application 202. The data collection library 204 includescomputer instructions for configuring synchronization settings andservice provider settings. The data collection library 204 can be anative library, e.g., an Android and iOS library, given the nativedevice dependencies. The data collection library 204 provides a settingsinterface for a scraping scheduler. The scraping scheduler is configuredto receive, through the settings interface, user input forsynchronization settings including, for example, time preference forscraping the data, scraping data over Wi-Fi connection, mobile dataconnection, or both. The data collection library 204 can store a localprotected copy of user credentials for accessing one or more targetsites, e.g., Web sites of service providers. In some implementations,the data collection library 204 can include computer instructions that,when executed, cause the mobile device 102 to store credentials foraccessing an external data gathering service that is configured toscrape data from the target sites. The external data gathering servicecan be a cloud-based service that a user already setup for collectingdata.

The data collection library 204 implements functions that interact witha data aggregation server 106 via an API 206. In some implementations,the API 206 can be a centralized REST API. The data collection library204 implements a site scraping agent, e.g., the agent 118 of FIG. 1. Thelibrary 204 can include functions for implementing a gatherer, e.g., thegatherer 116 of FIG. 1, that is based on the centralized REST API. Themobile device 102 can store the scraped data to the gatherer.

In some implementations, various functions implemented by the datacollection library 204 can cause the mobile device 102 to communicatewith the data aggregation server 106 through one or more intermediaryservices 208, e.g., to read data that has been processed by the dataaggregation server 106. The intermediary services 208 include componentsconfigured to process transactions between the mobile device 102 and thedata aggregation server 106 and process scraped account data. Theintermediary services 208 can include an ISS. The mobile device 102 canstore the scraped data in the ISS. A corresponding client agent can polland read the data from the ISS. Each client agent corresponds to arespective service provider.

The intermediary services 208 can include a gatherer server 210. In someimplementations, the gatherer server 210 can include the ISS. Thegatherer server 210 can include one or more computers configured toprovide site scraping scripts to the mobile device 102, or receive sitescraping scripts from the mobile device 102. The gatherer server 210 canreceive and store scraped data, e.g., one or more page dumps, from themobile device 102.

The mobile device 102 launches the application 202 according to thesynchronization settings, e.g., a schedule specified by a user thatspecifies that the application 202 shall be launched, for example, every24 hours, and, in particular, after wakeup or before sleep for allservice provider sites. Each of the wakeup and sleep corresponds to astate of the mobile device 102 where the mobile device 102 hasrecognized a particular usage pattern indicating that a user of themobile device 102 is likely to start or stop using the mobile device102. The mobile device 102 can open a communication session, e.g., aWebView session, in which navigation and scraping are performed. Thecommunication session can be an automatic and background session withouthaving to use a user interface such as a browser. The mobile device 102navigates to the respective target sites and relevant pages as specifiedfor each target site and collect the data from the pages. The pages caninclude, for example, home pages, monthly summary pages, and statementpages.

The mobile device 102 can scrape data from the pages, e.g., by taking apage dump or by parsing various sections of the pages to retrievespecific data fields. The mobile device submits the scraped data to thegatherer server 210 for processing. The mobile device 102 can performthe scraping and submission operations in rapid succession from one siteto another. The gatherer server 210 can include various tools, forexample, a Web automation and testing tool (e.g., Sahi), that areconfigured to process the scraped data. The gatherer server 210 canprovide the processed data to the data aggregation server 106.

The mobile device 102 can implement various features that allow thescraping to be fast and not to affect execution and usability of otherapplications of the mobile device 102. The scraping can be configured toincur minimal usage of resources of the mobile device 102, e.g., batterypower. The scraping minimizes the overall battery usage in various ways.For example, the scraping can be programmed to occur only when externalpower is available, e.g., when the mobile device 102 is plugged into acharging port or docked. Scraping can be configured to execute only whenthe battery level is above a threshold, e.g., above X percent. Thescraping can be configured to incur minimal usage of CPU, memory, andnetwork usage. For example, the mobile device 102 can be configured tominimize network usage as much as possible by using compression wherevernecessary. Network synchronization can be maintained in such a way thatthere is no “denial of service” errors, from the service provider'sservers, e.g., the target sites 108 of FIG. 1, or from data aggregator'sservers, e.g., the data aggregation server 106.

The intermediary services 208 can include a client aggregation libraryservice 212. The client aggregation library service 212 can include ashim library configured to interact with the application 202 to providefunctions including, for example, user behavior analysis, user settingsmanagement, among others. The client aggregation library service 212 caninteract with the data aggregation server 106, including obtainingvarious aggregation requests. The mobile device 102 keeps track of userbehavior, usage and operating environment of the mobile device 102. Theuser behavior, usage and operation environment can include, for example,network availability, e.g., whether Wi-Fi connection or mobileconnection is available; whether the mobile device 102 is docked; screenon/off period; phone charging patterns (time of the day), among others.The mobile device 102 can submit the tracked information to the clientaggregation library service 212 for analysis. The client aggregationlibrary service 212 can determine, based on the tracked information, apattern that corresponds to a wakeup state and a pattern thatcorresponds to a sleep state. The wakeup state and sleep state may bedifferent on different mobile devices, and can be particular to eachuser of a corresponding mobile device.

The mobile client library 204 tries to minimize cache refresh failures.In client side aggregation, there is a chance of cache refresh failureswhen the user does not wish the aggregation to happen. The datacollection library 204 is designed to minimize failed cache refreshes.The data collection library 204 adopts multiple strategies to minimizethe failures. For example, the data collection library 204 can implementmultiple operating modes to handle refreshes. The operating modesinclude, for example, a user specified timing mode, an automated timingmode, and a mixed mode.

In the user specified timing mode, the mobile device 102, or the clientaggregation library service 212, determines a user pattern ofsynchronizing with the data aggregation server 106 at one or more timeslots. The one or more time slots can include pre-defined time slotsthat a user sets, e.g., between 1 am and 2 am. The user-set time slotscan be specified in a database or a configuration file. Additionally,settings for the time slots can be associated with activities, or acombination of activities and time windows. Settings associated withactivities can specify, for example, data collection shall occur afterwakeup or before sleep. These example settings can indicate thatscraping is preferably performed at the beginning of the day or end ofthe day for the user. The user specified timing mode has the benefit ofpredictable data collection. This operating mode also establishes a userbehavior pattern towards the data aggregation server 106. This operatingmode is comparable to anyone checking email/messaging as and when theperson wakes up.

The mobile device 102 may prevent the scraping from happening at thespecified time when a high priority activity on the mobile device 102,e.g., a phone call, is ongoing. This feature would be more effective interms of usability if, for example, a personal financial manager programis configured to perform a real-time update of the user's data afterscraping. The real-time update can encourage a user to select the userspecified timing mode.

In the automated timing mode, the mobile device 102 automaticallydetermines specific periods when scraping can happen for the targetsites without requiring user input. For example, the mobile device 102can determine that data collection shall occur when the mobile device102 is docked for charging and when a screen of the mobile device is on,or when a user just finished a phone call, the screen is on and nofurther user interaction with the mobile device 102 is detected. In theautomated timing mode, the application 202 using the library 204 cankeep track of various device activities that may indicate user behavior.The device activities can include, for example, frequency of screen onand off, network availability, charging pattern, among others.

While the mobile device 102 submits collected data to the dataaggregation server 106, the mobile device 102 gives out an indicationthat synchronization is happening. The indication can include anotification. The mobile device 102 can keep the screen of mobile device102 unchanged during the synchronization. Accordingly, thesynchronization can be automated without user intervention, at least forsome devices, e.g., Android™ smartphones. The automated timing mode canpotentially minimize cache refresh failures. The automated timing modeis based on understanding of usage behavior of the mobile device 102. Inthe automated timing mode, the mobile device 102 may launch theapplication 202 at irregular intervals. To avoid interfering the user'sactivities, the mobile device 102 may interrupt the data scraping inresponse to a user input, e.g., a setting that specifies that theapplication 202 shall not be launched when certain activities areongoing. In some implementations, e.g., on iOS™ smartphones, a user caninitiate the synchronization.

In the mixed mode, the mobile device 102 uses a combination of userspecified timing and automated timing to improve the overall successrate of cache refreshes. In the mixed mode, the mobile device 102attempts to collect account data from target sites according to userspecified timing. If the scraping of the target sites was notsuccessful, e.g., was not completely finished during user specifiedtiming periods, the mobile device 102 can re-attempt the collectionaccording to automated timing logic. For example, the mobile device 102can enable scraping at pre-defined time and scrapes whenever possible.In some implementations, the mobile device 102 can enable scraping inresponse to user input.

The mobile device 102 can determine a total number of target sites to bescraped, and whether the total number exceeds a threshold. Upondetermining that the total number exceeds the threshold, the mobiledevice 102 determines if scraping these target sites will take more thana pre-specified time. In response to determining that the scraping willtake more than the specified time, the mobile device 102 keeps track ofthe user behavior to understand when to collect the data to avoid timeperiods during which the user may be using the mobile device 102.Operating in the mixed mode may have an overall improvement of successrate in cache refreshes.

To use the automated timing mode and the mixed mode, the mobile device102 or the client aggregation library service 212 can analyze useractivity patterns, e.g., a user's phone usage frequency, time andduration. In some implementations, the client aggregation libraryservice 212 can collect usage information submitted by the mobile device102 and perform the analysis. For example, the mobile device 102 canprovide a call log to the client aggregation library service 212. Theclient aggregation library service 212 can determine time periods thatthe user never made or received a phone call, and provide the timeperiods to the mobile device 102.

The data collection library 204 may call various features of the mobiledevice 102. These features can include applications installed on themobile device 102, for example, WebView, automated page navigationsoftware, and page download software. The data collection library 204can use compression wherever necessary or possible. The features of themobile device 102 called by the data collection library 204 can includesystem functions, e.g., push notifications, alarm notifications for timebased scraping, secure password locker, local preference cache, battery,display, memory and CPU tracker, and APIs for detecting battery lownotifications, display on and off detection, and application usage.

The mobile device 102 can take various approaches to scrape data. Insome implementations, the mobile device 102 can put all transactionapplications, e.g., banking and credit card management applications inan emulator pod, e.g., an Android emulator pod. The mobile device 102can scan the communication between mobile device 102 and a target siteof a service provider, and dump data related to the transaction to astorage device. The mobile device 102 aggregates the dumped data later,for example, at a time of low CPU usage and no network communication.This backup scheme can be implemented in case service providers do notuse security features that link service providers' applications to aphone and SIM card combination.

In some implementations, the mobile device 102 can use mobileapplication to mobile application communication to implement variousscraping techniques. The application-to-application communication canenable a scraping mobile application 202 to communicate with an externalmobile application instead of performing the scraping directly. Theexternal application can perform the data collection, and providecollected data to the scraping mobile application 202. The scrapingmobile application 202 can scan the data and retain the streams of thecommunication. The application-to-application communication approach canbe implemented in cases where various operating systems, e.g., Androidand iOS, allow application-to-application communication. To enableapplication-to-application communication, the mobile device 102 can berooted, an operation that some user may have access to. Thecommunication can be retained through TCP layer access. In order to makechanges to the TCP layer, one can modify the core OS TCP stack.Modifying the TCP stack may the mobile device 102 to be rooted.

In some implementations, the mobile device 102 is configured to log intoa service provider's website daily to pull data in the form of emailattachments. This approach utilizes various business-to-consumer (B2C)features provided by a service provider, for example, by usingconsumer-oriented features of the service provider.

FIG. 3 is a block diagram illustrating an example flow in distributeddata scraping. A data aggregation server 106 includes one or morecomputer processors is configured to receive requests from variousapplications 202 and 302 to aggregate various account data, and providethe aggregated data to the applications 202 and 302. Each of theapplications 202 and 302 can be an application configured to process andpresent aggregated data. For example, applications 202 and 302 caninclude census applications or personal asset management applications,e.g., Yodlee® Personal Financial Management (PFM) applications.

The data aggregation server 106 can include, or be coupled to, acredentials database 304. The credentials database 304 can store usercredentials for accessing one or more target sites of one or morerespective service providers. The data aggregation server 106 cancommunicate with one or more aggregation agents 306. The dataaggregation server 106 can inform each aggregation agent 306 ofsynchronization time and provide various tasks coordinating operationsof aggregation agents 306. Each aggregation agent 306 can communicatewith one or more target sites 308. A target site 308 can be a Web site,FTP site, or other site of a service provider customized for mobiledevices. An aggregation agent 306 can request the target site 308 tosend one or more data items to a user or to the data aggregation server106, e.g., as one or more email attachments. The applications 202 and302 can communicate with the one or more target sites 308 to collectdata from the one or more target sites 308. The data aggregation server106 and the target site 308 can communicate with a messaging system,e.g., an email server or an FTP server, to facilitate the submission ofthe data items.

FIG. 4 is a block diagram illustrating functional blocks of an exampleinformation extraction system. Each functional block, and each module ofthe functional block as described below, can be implemented by hardware,hardware and software, or hardware and firmware components of a mobiledevice or a data aggregation server, e.g., the mobile device 102 of FIG.1, the data aggregation server 106 of FIG. 1, or both.

The mobile device or data aggregation server can implement a sitemanagement functional block 402. The site management functional block402 is configured to add, manage, and delete one or more target sites,e.g., target sites 404 and 406, of service providers for scraping. Thesite management functional block 402 can store references to the addedsites, e.g., links, in a target site data store, as well as usercredentials of the added sites.

The mobile device or data aggregation server can implement a managementfeatures functional block 408. The management features functional block408 is configured to add, manage, and delete various applications thatconsume the aggregated data. The applications can include, for example,a personal asset management service 410, a small business lendingservice 412, or a business or personal loan service. These services canmanipulate the aggregated data and present results of various analysisof the aggregated data as one or more reports for display on userdevices.

The mobile device or data aggregation server can implement a user module414. The user module 414 is a component configured to receive user inputto perform various operations. The user module 414 includes a connectionmodule 416. The connection module 416 is a component configured tocreate a logical connection between an application, e.g., a personalasset management service, and a service provider. For example, theconnection module 416 can receive a user input to add a particular bankaccount to the personal asset management service, and create a logicalconnection between the bank account and the personal asset managementservice. The user module 414 can include a trigger module 418. Thetrigger module 418 can accept a user input to specify that an action isperformed in response to an event. For example, the trigger module 418can create a trigger in response to a user input. The trigger canspecify that, in response to an email message received by a user'smailbox from a particular service provider, an attachment of the emailis forwarded to a data aggregation server.

The mobile device or data aggregation server can implement anaggregation module 420. The aggregation module 420 is a componentconfigured to perform various data gathering operations. The aggregationmodule 420 includes a statement module 422. The statement module 422 isa component configured to collect account statements from serviceproviders. For example, directly or through an agent, a mobile device ora data aggregation server can use the statement module 422 to login to atarget site and navigate to a “generate account statement” page. The“generate account statement” page can be a page configured to generatean account statement, e.g., a bank statement or credit card statement,for a specified period of time. The statement module 422 can enter a“from” date and a “to” date to define the period of time. The statementmodule 422 can scrape the generated account statement.

The aggregation module 420 includes a messaging module 424. Themessaging module 424 is a component configured to present the statementto a user through messaging. For example, the messaging module 424 canemail a statement obtained by the statement module 422 to a user's emailinbox.

The aggregation module 420 includes a sharing module 426. The sharingmodule 426 is a component configured to share the statement with theapplications, e.g., the personal asset management service 410, the smallbusiness lending service 412, or a business or personal loan serviceprovided. The sharing can include retrieving the statements provided bythe messaging module 424 and received by the data aggregation server,and submitting a representation of the statement through a previouslycreated connection.

FIG. 5 is a block diagram of a conventional information extractionsystem 500 for reference. In a conventional system 500, a user canaccess the user's aggregated information through a personal application502. The personal application 502 can execute on a mobile device. Thepersonal application 502, instead of communicating with one or moretarget sites 504 directly, communicates with a data aggregator 506. Thedata aggregator 506 is coupled to a credentials database 508. Thecredentials database 508 stores login credentials of the user. The dataaggregator 506 logs into the target sites 504, which are sites ofservice providers. The data aggregator 506 scrapes the data, processesthe scraped data, and provides the processed data to the personalapplication 502 for consumption, e.g., for presentation to the user.

Multiple user may request aggregated data. In the conventional system500, data scraping is centralized at the data aggregator 506. The dataaggregator 506 maintains login information for all users. The scrapingmay not be always feasible, because the target sites 504 may implementsecurity measures preventing the data aggregator 506 from login.

FIGS. 6A-6B illustrate example user interfaces of an informationextraction system. The user interfaces can be presented on a mobiledevice, e.g., the mobile device 102 of FIG. 1.

FIG. 6A illustrates an example start page 602. The mobile device canlaunch an application, e.g., an account management application. Theapplication causes a start page 602 to be displayed on a displaysurface, e.g., a touch screen, of the mobile device. The start page 602can include a virtual button 604. The virtual button 604 is a userinterface item that, upon receiving a touch input, cause the applicationto move to a next page.

FIG. 6B illustrates an example account page 606. The mobile device candisplay the account page 606 in response to a user input on the virtualbutton 604. The account page 606 can correspond to a particular serviceprovider, e.g., “Acme Investments,” that has been previously registeredwith the application, e.g., through the site management functional block402 of FIG. 4. For a first time access, the account page 606 can displaya login screen, including username field 608 for receiving a username, apassword field 610 for receiving a password, a remember credential field612 that, if checked, causes the application to store credentials forlogging into the account, and a login virtual button 614 for logginginto the account. In some implementations, the application remembers thecredentials entered in the username field 608 and password field 610.The application can log into the account periodically, in the userspecified time mode, automatic time mode, or mixed time mode asdiscussed above. The application can then collect data from the serviceprovider's site and store the collected data on the mobile device.

FIG. 6C illustrates an example summary page 616. Upon logging into anaccount, the application can retrieve various information related to theaccount. The information can include aggregated information, e.g.,information from investment accounts and credit card accounts. Theapplication can retrieve the information real time or, in someimplementations, scrape the information from the service providers site,store the information on the mobile device or on a data aggregationserver, and retrieve the information from the storage.

FIG. 6D illustrates an example projection page 618. In the projectionpage 618, the application can display data aggregated from variousaccounts registered with the application and enriched data thatindicates projected balance in the future. The application can register,through the site management functional block 402 of FIG. 4, one or morestreaming content subscription accounts, one or more credit cardaccounts and various purchase accounts in addition to the serviceprovide (“Acme Investments”) account. Analyzing history of scraped data,the application can determine that a user has been clearing the variousaccounts using the service provider (“Acme Investments”) account. Theapplication can then display account information for those accounts, andtheir respective projected impact on the “Acme Investments” account inthe projection page 618.

FIG. 7 is an example configuration user interface 702 of an informationextraction system. The configuration user interface 702 can be presentedon a mobile device, e.g., the mobile device 102 of FIG. 1. Theconfiguration user interface 702 can accept user input for settingvarious parameters and preferences of scraping data. The configurationuser interface 702 can be an interface for a scraping scheduler, whichcan be implemented by a data collection library, e.g., the datacollection library 204 of FIG. 2.

The configuration user interface 702 includes a synchronization timesetting section 704. The synchronization time setting section 704includes one or more user interface items configured to receive userinput for setting one or more time periods for collecting data fromtarget sites, e.g., after wakeup or before sleep. The mobile device can,for example, display a notification after being turned on accordingly,after wakeup or before sleep. The mobile device can then receive a userconfirmation to start collecting the data.

The configuration user interface 702 includes a network preferencesection 706. The network preference section 706 includes one or moreuser interface items configured to receive user input for settingnetwork conditions for collecting data from target sites. The networkconditions can include, for example, whether data collection occurs onlyover Wi-Fi connections, whether data collection occurs only over mobiledata connections, or whether data collection can occur either over Wi-Ficonnections or over mobile data connections.

The configuration user interface 702 includes a site preference section708. The site preference section 708 includes one or more user interfaceitems configured to receive user input for adding a target site,configuring a target site, or deleting a target site. Configuring atarget site can include entering site link and credentials for accessingthe site.

FIG. 8 is a flowcharts illustrating a first example process 800 ofinformation extraction by a mobile device. The mobile device can be themobile device 102 of FIG. 1.

The mobile device receives (802) a request to aggregate transaction datafrom one or more transaction servers of a target site. The target sitecan be a Web site or a file repository, e.g., an FTP site. The targetsite corresponds to a service provider, e.g., a hospital, school, orbank.

The mobile device determines (804) whether a first condition issatisfied. The first condition can include a time, power, bandwidth orusage condition. The first condition can be a user specified condition.The first condition can be a time condition that specifies whether themobile device can collect data in a particular time period, e.g.,whether the mobile device should scrape the data after wakeup or beforesleep. The time condition can be a whether a pre-set scraping time,e.g., 1:00 am every day, has been reached. The first condition can be abandwidth condition specifying that scraping shall occur when a Wi-Ficonnection is present.

In response to determining that the first condition is satisfied, themobile device scrapes (806) the transaction data from the one or moretransaction servers by the mobile device, including providing respectiveuser credentials to each transaction server and navigating a respectivemobile web site of each transaction server.

The mobile device determines (808) whether a second condition issatisfied. The second condition, like the first condition can include atime, power, bandwidth or usage condition. The second condition can bethe same as, or different from, the first condition. For example, thesecond condition can be a time condition of whether a pre-set datasubmission time, e.g., 3:00 am every day, has been reached. Each of thefirst condition and second condition can be based on a user-specifiedtiming mode, an automatic timing mode, or a mixed timing mode.

In response to determining that the second condition is satisfied, themobile device provides (810) the scraped transaction data from themobile device to a data aggregation server. The data aggregation servercan include one or more computers of the data aggregation server 106 ofFIG. 1.

FIG. 9 is a flowcharts illustrating a second example process 900 ofinformation extraction by a mobile device. The process 900 can beperformed by a mobile device including one or more processors, e.g., themobile device 102 of FIG. 1.

The mobile device receives (902) a request to scrape account data. Themobile device can receive the request from a scraping scheduler of themobile device. The scraping scheduler can be part of a scraping library,e.g., the data collection library 204 of FIG. 2. The mobile device canreceive the request from a data aggregation server, which can be acomputer of the data aggregation server 106 of FIG. 1 that includes oneor more processers. The data aggregation server can send the request canaccording to a schedule specified by a user of the mobile device.

The mobile device determines (904) whether a data scraping condition issatisfied. The data scraping condition can be specified on the mobiledevice. The data scraping condition includes at least one of a firsttime condition, a first power condition, a first bandwidth condition, ora first device usage condition. For example, the first time conditioncan specify whether the scraping should occur after wakeup or beforesleep. The first bandwidth condition can specify that the scrapingoccurs when a Wi-Fi connection is present.

The mobile device determines (906), from a target database, one or moretarget sites from which to scrape the account data. The target sites canbe entered by a user of the mobile device, or provide by a dataaggregation program of the data application server.

In response to determining that the data scraping condition issatisfied, the mobile device scrapes (908) the account data from the oneor more target sites. Scraping the account data includes providingrespective user credentials to each target site and navigating pages,e.g., pages of a respective mobile web site, of each target site. Themobile device the retrieves the account data from the pages.

The mobile device determines (910) whether a data submission conditionis satisfied. The data submission condition can be specified on themobile device. The data submission condition includes at least one of asecond power condition, a second bandwidth condition, or a second deviceusage condition. Each condition, including the data scraping conditionand the data submission condition, can be based on a user-specifiedtiming mode, and automatic timing mode, and a mixed timing mode.

In response to determining that the data submission condition issatisfied, the mobile device provides (912) the scraped account datafrom the mobile device to the data aggregation server. The dataaggregation server can aggregate the scraped data, including aggregatethe account data scraped by the mobile device and other data scraped bythe data aggregation server. The data aggregation server can generate anaccount report from the aggregated data, and provide the account reportto the mobile device, or another user device, for storage, forpresentation on a screen or for printout.

Example Mobile Device Architecture

FIG. 10 is a block diagram of an example architecture 1000 for a mobiledevice. A mobile device (e.g., the mobile device 102 of FIG. 1) caninclude memory interface 1002, one or more data processors, imageprocessors and/or processors 1004, and peripherals interface 1006.Memory interface 1002, one or more processors 1004 and/or peripheralsinterface 1006 can be separate components or can be integrated in one ormore integrated circuits. Processors 1004 can include applicationprocessors, baseband processors, and wireless processors. The variouscomponents in the mobile device, for example, can be coupled by one ormore communication buses or signal lines.

Sensors, devices and subsystems can be coupled to peripherals interface1006 to facilitate multiple functionalities. For example, motion sensor1010, light sensor 1012 and proximity sensor 1014 can be coupled toperipherals interface 1006 to facilitate orientation, lighting andproximity functions of the mobile device. Location processor 1015 (e.g.,GPS receiver) can be connected to peripherals interface 1006 to providegeopositioning. Electronic magnetometer 1016 (e.g., an integratedcircuit chip) can also be connected to peripherals interface 1006 toprovide data that can be used to determine the direction of magneticNorth. Thus, electronic magnetometer 1016 can be used as an electroniccompass. Motion sensor 1010 can include one or more accelerometersconfigured to determine change of speed and direction of movement of themobile device. Barometer 1017 can include one or more devices connectedto peripherals interface 1006 and configured to measure pressure ofatmosphere around the mobile device.

Camera subsystem 1020 and an optical sensor 1022, e.g., a chargedcoupled device (CCD) or a complementary metal-oxide semiconductor (CMOS)optical sensor, can be utilized to facilitate camera functions, such asrecording photographs and video clips.

Communication functions can be facilitated through one or more wirelesscommunication subsystems 1024, which can include radio frequencyreceivers and transmitters and/or optical (e.g., infrared) receivers andtransmitters. The specific design and implementation of thecommunication subsystem 1024 can depend on the communication network(s)over which a mobile device is intended to operate. For example, a mobiledevice can include communication subsystems 1024 designed to operateover a GSM network, a GPRS network, an EDGE network, a Wi-Fi™ or WiMax™network and a Bluetooth™ network. In particular, the wirelesscommunication subsystems 1024 can include hosting protocols such thatthe mobile device can be configured as a base station for other wirelessdevices.

Audio subsystem 1026 can be coupled to a speaker 1028 and a microphone1030 to facilitate voice-enabled functions, such as voice recognition,voice replication, digital recording, and telephony functions. Audiosubsystem 1026 can be configured to receive voice commands from theuser.

I/O subsystem 1040 can include touch surface controller 1042 and/orother input controller(s) 1044. Touch surface controller 1042 can becoupled to a touch surface 1046 or pad. Touch surface 1046 and touchsurface controller 1042 can, for example, detect contact and movement orbreak thereof using any of a plurality of touch sensitivitytechnologies, including but not limited to capacitive, resistive,infrared, and surface acoustic wave technologies, as well as otherproximity sensor arrays or other elements for determining one or morepoints of contact with touch surface 1046. Touch surface 1046 caninclude, for example, a touch screen.

Other input controller(s) 1044 can be coupled to other input/controldevices 1048, such as one or more buttons, rocker switches, thumb-wheel,infrared port, USB port, and/or a pointer device such as a stylus. Theone or more buttons (not shown) can include an up/down button for volumecontrol of speaker 1028 and/or microphone 1030.

In one implementation, a pressing of the button for a first duration maydisengage a lock of the touch surface 1046; and a pressing of the buttonfor a second duration that is longer than the first duration may turnpower to the mobile device on or off. The user may be able to customizea functionality of one or more of the buttons. The touch surface 1046can, for example, also be used to implement virtual or soft buttonsand/or a keyboard.

In some implementations, the mobile device 102 can present recordedaudio and/or video files, such as MP3, AAC, and MPEG files. In someimplementations, the mobile device can include the functionality of anMP3 player. Other input/output and control devices can also be used.

Memory interface 1002 can be coupled to memory 1050. Memory 1050 caninclude high-speed random access memory and/or non-volatile memory, suchas one or more magnetic disk storage devices, one or more opticalstorage devices, and/or flash memory (e.g., NAND, NOR). Memory 1050 canstore operating system 1052, such as Darwin, RTXC, LINUX, UNIX, OS X,WINDOWS, or an embedded operating system such as VxWorks. Operatingsystem 1052 may include instructions for handling basic system servicesand for performing hardware dependent tasks. In some implementations,operating system 1052 can include a kernel (e.g., UNIX kernel).

Memory 1050 may also store communication instructions 1054 to facilitatecommunicating with one or more additional devices, one or more computersand/or one or more servers. Memory 1050 may include graphical userinterface instructions 1056 to facilitate graphic user interfaceprocessing; sensor processing instructions 1058 to facilitatesensor-related processing and functions; phone instructions 1060 tofacilitate phone-related processes and functions; electronic messaginginstructions 1062 to facilitate electronic-messaging related processesand functions; web browsing instructions 1064 to facilitate webbrowsing-related processes and functions; media processing instructions1066 to facilitate media processing-related processes and functions;GPS/Navigation instructions 1068 to facilitate GPS andnavigation-related processes and instructions; camera instructions 1070to facilitate camera-related processes and functions; magnetometer data1072 and calibration instructions 1074 to facilitate magnetometercalibration. The memory 1050 may also store other software instructions(not shown), such as security instructions, web video instructions tofacilitate web video-related processes and functions, and/or webshopping instructions to facilitate web shopping-related processes andfunctions. In some implementations, the media processing instructions1066 are divided into audio processing instructions and video processinginstructions to facilitate audio processing-related processes andfunctions and video processing-related processes and functions,respectively. An activation record and International Mobile EquipmentIdentity (IMEI) or similar hardware identifier can also be stored inmemory 1050. Memory 1050 can store scraping instructions 1076 that, whenexecuted, can cause processor 1004 to perform operations of datascraping, including executing example processes 800 and 900 as describedabove in reference to FIG. 8 and FIG. 9, respectively.

Each of the above identified instructions and applications cancorrespond to a set of instructions for performing one or more functionsdescribed above. These instructions need not be implemented as separatesoftware programs, procedures or modules. Memory 1050 can includeadditional instructions or fewer instructions. Furthermore, variousfunctions of the mobile device may be implemented in hardware and/or insoftware, including in one or more signal processing and/or applicationspecific integrated circuits.

Example Operating Environment

FIG. 11 is a block diagram of an example network operating environment1100 for the mobile devices of FIGS. 1-6. Mobile devices 1102 a and 1102b can, for example, communicate over one or more wired and/or wirelessnetworks 1110 in data communication. For example, a wireless network1112, e.g., a cellular network, can communicate with a wide area network(WAN) 1114, such as the Internet, by use of a gateway 1116. Likewise, anaccess device 1118, such as an 802.11g wireless access point, canprovide communication access to the wide area network 1114. Each ofmobile devices 1102 a and 1102 b can be mobile device 102.

In some implementations, both voice and data communications can beestablished over wireless network 1112 and the access device 1118. Forexample, mobile device 1102 a can place and receive phone calls (e.g.,using voice over Internet Protocol (VoIP) protocols), send and receivee-mail messages (e.g., using Post Office Protocol 3 (POP3)), andretrieve electronic documents and/or streams, such as web pages,photographs, and videos, over wireless network 1112, gateway 1116, andwide area network 1114 (e.g., using Transmission ControlProtocol/Internet Protocol (TCP/IP) or User Datagram Protocol (UDP)).Likewise, in some implementations, the mobile device 1102 b can placeand receive phone calls, send and receive e-mail messages, and retrieveelectronic documents over the access device 1118 and the wide areanetwork 1114. In some implementations, mobile device 1102 a or 1102 bcan be physically connected to the access device 1118 using one or morecables and the access device 1118 can be a personal computer. In thisconfiguration, mobile device 1102 a or 1102 b can be referred to as a“tethered” device.

Mobile devices 1102 a and 1102 b can also establish communications byother means. For example, wireless device 1102 a can communicate withother wireless devices, e.g., other mobile devices, cell phones, etc.,over the wireless network 1112. Likewise, mobile devices 1102 a and 1102b can establish peer-to-peer communications 1120, e.g., a personal areanetwork, by use of one or more communication subsystems, such as theBluetooth™ communication devices. Other communication protocols andtopologies can also be implemented.

The mobile device 1102 a or 1102 b can, for example, communicate withone or more services 1130, 1140, and 1150 over the one or more wiredand/or wireless networks. For example, one or more data aggregationservices 1130 can provide aggregated service provider data to mobiledevices 1102 a and 1102 b. Reporting service 1140 can provide aggregatedservice provider data to data analysis customers, e.g., researchinstitutes. Transaction service 1150 can provide transaction data foraggregation.

Mobile device 1102 a or 1102 b can also access other data and contentover the one or more wired and/or wireless networks. For example,content publishers, such as news sites, Really Simple Syndication (RSS)feeds, web sites, blogs, social networking sites, developer networks,etc., can be accessed by mobile device 1102 a or 1102 b. Such access canbe provided by invocation of a web browsing function or application(e.g., a browser) in response to a user touching, for example, a Webobject.

A number of implementations of the invention have been described.Nevertheless, it will be understood that various modifications can bemade without departing from the spirit and scope of the invention.

What is claimed is:
 1. A system comprising: a data aggregation serverconfigured to perform operations comprising receiving, for each user ofa plurality of users, authorization to aggregate account data for aplurality of different accounts of the user, issuing, for each mobiledevice of a plurality of mobile devices with each mobile deviceassociated with a different respective user of the plurality of users, arequest to scrape account data for one or more of the different accountsof the respective user, receiving, from each mobile device of theplurality of mobile devices, the scraped account data for the one ormore of the different accounts of the respective user for the mobiledevice, and generating, for each user of the plurality of users andusing the scraped account data corresponding to the user, a report thatincludes aggregated account data for the plurality of different accountsof the user; and the plurality of mobile devices, wherein each mobiledevice is associated with a different respective user and is configuredto perform operations comprising receiving, from the data aggregationserver, a request to scrape account data for one or more differentaccounts of the respective user associated with the mobile device,determining, from a target database, a respective target site for eachof the one or more different accounts of the respective user, inresponse to receiving the request, scraping the account data from theone or more target sites, including providing respective usercredentials to each target site, navigating pages of each target site,and retrieving the account data from the pages, and providing thescraped account data for the one or more different accounts of therespective user to the data aggregation server.
 2. The system of claim1, wherein a request to scrape account data for a plurality of differentaccounts of a particular user is issued to a particular mobile deviceassociated with the particular user according to a schedule specified bythe particular user.
 3. The system of claim 1, wherein providing thescraped account data for the one or more different accounts of therespective user to the data aggregation server comprises determiningthat a data submission condition is satisfied.
 4. The system of claim 3,wherein the data submission condition includes at least one of a timecondition, a power condition, a bandwidth condition, or a device usagecondition.
 5. The system of claim 3, wherein each data submissioncondition is specified on a respective mobile device of a respectiveuser.
 6. The system of claim 1, wherein scraping the account data fromthe one or more target sites includes determining that a data scrapingcondition is satisfied.
 7. The system of claim 6, wherein data scrapingcondition included at least one of a time condition, a power condition,a bandwidth condition, or a device usage condition.
 8. The method ofclaim 6, wherein the data scraping condition specifies whether thescraping should occur after wakeup or before sleep.
 9. The method ofclaim 6, wherein the data scraping condition specifies that the scrapingshall occur when a Wi Fi connection is present.
 10. The method of claim6, wherein each condition is based on a user specified timing mode, andautomatic timing mode, and a mixed timing mode.
 11. The system of claim6, wherein each data scraping condition is specified on a respectivemobile device of a respective user.
 12. The system of claim 11, whereinthe request is obtained by a scraping scheduler of the mobile device ofthe user.
 13. The system of claim 12, wherein the request is sent fromthe data aggregation server according to a schedule specified by theuser of the mobile client device.
 14. The system of claim 1, wherein acontent of the report corresponding to a particular user generated bythe data aggregation server is specified by the particular user througha user interface provided by the data aggregation server.
 15. The systemof claim 1, wherein security features of a particular target siteprevent the data aggregation server from accessing the target site. 16.The system of claim 1, further comprising an intermediate storage serverconfigured to perform operations comprising: receiving scraped accountdata of a user at a first time; and providing the scraped account dataof the users to an agent of the data aggregation server at a secondtime.
 17. The system of claim 1, wherein the data aggregation server isconfigured to perform operations further comprising: providing, for eachuser of the plurality of users, the generated report corresponding tothe user to the mobile device associated with the user for presentation.